The <scp>RadOrgMiner</scp> pipeline: Automated genotyping of organellar loci from <scp>RADseq</scp> data

نویسندگان

چکیده

Reduced complexity or reduced genomic representation library (RRL) approaches made the in-depth study of micro-evolutionary processes feasible giving rise to ecological population genomics (Luikart et al., 2019; Narum 2013). A growing number publications rely on cost-effective SNP discovery available through RRL (Leaché & Oaks, 2017). Arguably, most popular methods nowadays (Holliday 2019) are from group Restriction-site-Associated DNA sequencing (RADseq) (Andrews 2016). Originally, term was meant refer a particular protocol used obtain sequence information about large loci (Baird 2008). The that include restriction enzyme set at genome level can be collectively called RADseq Later, numerous variants protocols, relying type II enzymes sample diversity, have been developed (reviewed by Andrews 2016; Rivera-Colón 2021), with each them having strengths for given purpose being disadvantageous in certain ways providing different horizontal (genome coverage) and vertical coverage (read depth) (Davey 2013; Elshire 2011; Hohenlohe Peterson 2012; Puritz, Matz, 2014; Toonen Typically, whole anonymously (i.e. no priori is known regarding origins genome-wide reads; 2019). Nevertheless, irrespective bioinformatic pipeline used, ascertained typically treated as samples nuclear genome. Only few pipelines take presence organellar RAD tags into consideration fine-tuned haploid data such ipyrad (Eaton Overcast, 2020). However, reference genome-based analysis D'Agostino al. (2018), only 53.6% GBS aligned uniquely closely related These authors hypothesised unmapped sequences might contain high frequency, which partly, but not exclusively, could reduce ratio their study. This suggests frequency reads RRLs more significant than previously thought, although limited cut site (Bentley previous studies shown even partially represented obtained provide additional insight genetic composition studied organisms (Meger Stobie Genetic organelles has long utilised phylogenetics phylogeography sources non-recombining, haploid, uniparentally inherited compartments (Avise, 2000, 2004; Soltis Soltis, 1998; Uncu 2015) often show correlated structure geography. Thus, important source phylogeographical analyses until recently (Brito Edwards, 2009; McCormack 2013) owing its one-fourth effective size compared leads rapid lineage sorting (Schaal Olsen, 2000). In addition, ‘cytonuclear discordance’ (Rieseberg 1991) ‘mito-nuclear’ discordance (Funk Omland, 2003; Toews Brelsford, 2012) open window hybridisation via phylogenetic incongruence (Wendel Doyle, 1998) between genes. Comparison datasets within same organism gained some popularity usually uncovered lineages (Barnard-Kubow 2015; Macher Moura Puckett Streicher Sutherland Galloway, 2018; Uckele 2021). these studies, however, dataset regarded ‘representative’ genome, experiments. may also potentially sort out ones (Forsman 2017; Meger Terraneo 2018). this case, provided separating without further effort. utility approach, what we ‘organellar mining’, addressed handful (Clugston Du 2020; Feng Forsman McVay Pujolar All either use existing software—such Stacks (Rochette applied (2019) ipyRAD 2020) (2020) GATK (McKenna 2010) demonstrated (2019)—to genomes, sometimes proprietary solutions (e.g. Geneious Just unique properties (2017) proposed ready-to-use tool designed explicitly assembly genomes using paired-end RADSeq data. Although fast, contrast still highly important. Here, introduce custom pipeline, RADOrgMiner, genotype found generated Our command line compatible UNIX-like operating systems allows subsequent comparison coming We demonstrate performance our software solution re-analysing eight publicly (Table 1) span levels divergence. uses tools screen if they align well It separates non-organellar ones, then genotypes (Figure 1). two main steps. first step, all bwa 0.7.17 (Li, separate samtools 1.10.2 (Li 2009). To decrease chimeric resulting data, require both ends reads. As concerted evolution plastid inverted repeats cannot ruled Knox, 2014), alignment originating repeat increase ambiguous alignments read regions mapping quality) inflating missing final dataset. Optional masking one done makes genotyping located reliable assumed identical. Location identified self-blasting blastn 2.10.1+ (Altschul 1997). By self-blasting, mean conducting blast search query against itself. longest matching will configuration. If an present (plastid) second third highest scoring pairs (HSPs) should sequences. length similarity (100% identical). automatic selects HSP masks location subject N-s. At organelle(s) retained haplotype calling, whereas unaligned (non-organellar) saved downstream analyses. minimise amount false (NUPTs) mitochondrial (NUMTs) DNA, interval processed individual locus depth higher any defined minimum value (as exemplified Table 2). Nuclear expected lower relative (Ekblom 2014). Similarly, relatively libraries Clugston When filtering loci, assume purely NUMTs NUPTs corresponding ‘original’ sampled) would ‘true’ loci. copy reach thousand copies (Richly Leister, 2004a; Richly 2004b), effect supposedly number. range ten thousands (Bendich, 1987) per cell depending tissue type. similar over-representation observed mitochondria when energetically active tissue, muscle, isolation Wolf, Consequently, cases, low-copy NUMTs/NUPTs. variable, bearing 2004b). For introduced option filter maximum default million. it recommended parameter exclude Setting threshold prior variant calling help falsely narrow likely cytoplasmic origin. call haplotypes freebayes 1.3.2 (Garrison Marth, 2012), intervals created bedtools 2.29.2 (Quinlan Hall, 2010). An defines region continuously overlapping locus. light loci's depth, visualised ‘spikes’ along function advantage approach parallelised drastically run time step. chose easy customisability Bayesian settings consider quality larger 30 bases 20. Minimum base step five probable alleles five, and, low-frequency mismatches calls, 40% total required alternate allele called. constraint best arbitrarily chosen aims running observed. worth highlighting actual nearly identical copy, frequency. least diverged youngest transferred (Michalovova After incorporation increases post-insertion duplication turn, young sharing low sampled library. alternative coupled constraining effectively decreases errors dataset, helps eliminate calls originate NUMTs. polymorphisms nucleus-transferred regions. above changed line, allowing fine-tuning Species, multiple populations analysed species, population-based inference model partitioned groups supplied pipeline. likelihood calculation clumping disabled, priors Hardy–Weinberg equilibrium turned off. Binomial observation off, placement probability, strand balance probability position instead. Since capable ploidy-aware ploidy default. sites annotated, including monomorphic exported vcf file. Missing arising mainly sheared experiments, filtered vcftools 0.1.16 (Danecek 2011) 20% missingness across individuals case narrowed down base-pairs and/or sequenced. Vcf files converted fasta vcf2fasta vcflib 1.0 package 2021) muscle 3.8.1 (Edgar, 2004). conversion, subset start end coordinates specified way, included loci) calculate statistics, length, polymorphic informative sites, concatenate 100 (bp) AMAS python (Borowiec, Removing partial failed yield shorter length) fragmented prone fragmentation transposable elements tend decay over consequence fragmentation, pseudo-organellar length. recovered proof concept excluding NUMTs, up experiment again fixing assuming pooled benchmark described below. other parameters were left 2. experiment, investigated (the showing AB) evaluated ‘non-haploid’ observations, assessed supported observations) occurs in. Here sequence. Without level, site, expect AB 0 observation. after applying filters remained inflate heterozygosity values. drawback inability differentiate NUMTs/NUPTs heteroplasmy. account individuals, values calculated separately independent observations parameterised reproducible usability. runs conducted Debian 10.1 environment. avoided inclusion boost transparency reproducibility approach. list dependencies, installation instructions, documentation example at: https://github.com/laczkol/RADOrgMiner. power mining representing various focus family-level phylogenies phylogeography) Some screened original authors, those results compare output randomly selected literature represent mitochondrion chloroplast. focused flavours instead scope plant systems, botany molecular variability plastome there much greater availability plastomes mitochondria. Datasets downloaded input references cases Cycadales Porites datasets, assess robustness (Tables 1 Stellaria belong ‘broad’ taxonomic (Sharples Tripp, 2019), consisted fewer samples. inspected increasing genotyped tree reconstruction IQtree 2.0.3 (Minh constant sites) relied showed initial partitioning scheme partition. selection (MFP + MERGE) apply optimal substitution approximate test (aLRT) (Anisimova Gascuel, 2006) branch support 1000 replications. Results interpreted authors' results. statistical aLRT ≥ 80%. trees, proportion assemblies R 3.6.3 (R Core Team, ggplot 2 3.3.4 (Wickham, 2016) ggtree 2.0.4 (Yu 2017), edited Inkscape 0.92 (https://inkscape.org/) improve readability. contained tags. seemed variable adjustments gain Xylosandrus yielded setting pipeline). checking distribution samples, non-overlapping noticeable. analyse increased 50% analysis. decreased 198 76. resulted scattered Melicope dataset; thus, 10%. (ezRAD) had lowest coverage. (ddRAD) yielding coverage, Paragorgia (sdRAD) covered mitochondrion. Despite this, Labeobarbus datasets. Helianthemum (GBS) discovered dataset—that linked star activity 2f), 5′ 3′ recognised SbfI—yielded RADseq-derived suitable describe re-analysis here. detailed re-analysed reconstructed please consult Supporting Information placed natalensis distinct clade 3 Figure S5). L. aeneus kimberleyensis sister bore kimberleyensis. Technical replicates difference (La005, LnBL004) counting equal regardless used; 4 S12) M. mountperriensis based longer covering 64.26% genus Cycas separated distance. Within ingroup, Dioon mejiae diverge earliest. tribe Encephalarateae rest family Stangeriaceae Ceratozamieae appeared mixed. Bowenia spectabilis earliest Ceratozamia kuesteriana Stangeria eriopus clustered Zamieae (Microcycas calocoma Zamia integrifolia). S15). Relative values, occasionally allele. > occurred mostly rarely never dominated Removal positions did change outcome reconstructions (results shown). introduces RADOrgMiner specifically Even though 2), consistent according taxonomy together levels). Unlike mine specialised thus automatised task. tailored assemble mention few, (organellar) reconstruct constrained. dDocent (Puritz, Hollenbeck, 2014) specifying like haplotypes. de novo reference-based expects diploid wide SNPs. Still, (2019), haplotypes, need manual curation. abovementioned tools, features tags—such possible NUPTs—into statistics short informed decision made, eliminating curation reliability mined great variety resolution (Figures 4; Figures S1–S14) adequately supplement studies' findings corroborating large-scale picture bringing evidence hybridisation. revealed S15) Given similarly 1, heteroplasmy, error chimerisation regard removing effectively. Below, briefly evaluate re-analyses published reflect limitations Organellar variability, influence reconstruction, Sibogagorgia cauliflora kaupeka differently explains why S3 S4) deduced application [Biomatters Inc.] RADOrgMiner). three clades general concordant (2019). Moreover, hybrid authors. rate technical replicates—that analysis—suggests accurate method presented highlights potential pipeline: smaller cases. uneven stem wet-lab (SE Illumina NextSeq). congruent S6 S7) Storer (2017). proves ancient events detected supplementing SNPs Paetzold

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

the effect of traffic density on the accident externality from driving the case study of tehran

در این پژوهش به بررسی اثر افزایش ترافیک بر روی تعداد تصادفات پرداخته شده است. به این منظور 30 تقاطع در شهر تهران بطور تصادفی انتخاب گردید و تعداد تصادفات ماهیانه در این تقاطعات در طول سالهای 89-90 از سازمان کنترل ترافیک شهر تهران استخراج گردید و با استفاده از مدل داده های تابلویی و نرم افزار eviews مدل خطی و درجه دوم تخمین زده شد و در نهایت این نتیجه حاصل شد که تقاطعات پر ترافیک تر تعداد تصادفا...

15 صفحه اول

Automated Pipeline Extraction from Interferometric Sar Data of the Ers Tandem Mission

A new method for the automated extraction of pipelines and roads from Synthetic Aperture Radar (SAR) scenes is presented. It combines intensity data with coherence data from an interferometric evaluation of a SAR scene pair. The fusion is based on Bayesian statistics and part of a Markov random field (MRF) model for line extraction. Both, intensity and coherence data are evaluated using rotatin...

متن کامل

the role of russia in transmission of energy from central asia and caucuses to european union

پس ازفروپاشی شوروی،رشد منابع نفت و گاز، آسیای میانه و قفقاز را در یک بازی ژئوپلتیکی انرژی قرار داده است. با در نظر گرفتن این منابع هیدروکربنی، این منطقه به یک میدانجنگ و رقابت تجاری برای بازی های ژئوپلتیکی قدرت های بزرگ جهانی تبدیل شده است. روسیه منطقه را به عنوان حیات خلوت خود تلقی نموده و علاقمند به حفظ حضورش می باشد تا همانند گذشته گاز طبیعی را به وسیله خط لوله مرکزی دریافت و به عنوان یک واس...

15 صفحه اول

Transforming Microbial Genotyping: A Robotic Pipeline for Genotyping Bacterial Strains

Microbial genotyping increasingly deals with large numbers of samples, and data are commonly evaluated by unstructured approaches, such as spread-sheets. The efficiency, reliability and throughput of genotyping would benefit from the automation of manual manipulations within the context of sophisticated data storage. We developed a medium- throughput genotyping pipeline for MultiLocus Sequence ...

متن کامل

Hal: an Automated Pipeline for Phylogenetic Analyses of Genomic Data

The rapid increase in genomic and genome-scale data is resulting in unprecedented levels of discrete sequence data available for phylogenetic analyses. Major analytical impasses exist, however, prior to analyzing these data with existing phylogenetic software. Obstacles include the management of large data sets without standardized naming conventions, identification and filtering of orthologous...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Methods in Ecology and Evolution

سال: 2022

ISSN: ['2041-210X']

DOI: https://doi.org/10.1111/2041-210x.13937